Goto

Collaborating Authors

 knn graph


Incorporating Fairness in Neighborhood Graphs for Fair Spectral Clustering

Moorthy, Adithya K, Saradhi, V Vijaya, Prasad, Bhanu

arXiv.org Artificial Intelligence

Abstract--Graph clustering plays a pivotal role in unsupervised learning methods like spectral clustering, yet traditional methods for graph clustering often perpetuate bias through unfair graph constructions that may underrepresent some groups. The current research introduces novel approaches for constructing fair k-nearest neighbor (kNN) and fair ϵ-neighborhood graphs that proactively enforce demographic parity during graph formation. By incorporating fairness constraints at the earliest stage of neighborhood selection steps, our approaches incorporate proportional representation of sensitive features into the local graph structure while maintaining geometric consistency. Our work addresses a critical gap in pre-processing for fair spectral clustering, demonstrating that topological fairness in graph construction is essential for achieving equitable clustering outcomes. Widely used graph construction methods like kNN and ϵ-neighborhood graphs propagate edge based disparate impact on sensitive groups, leading to biased clustering results. Providing representation of each sensitive group in the neighborhood of every node leads to fairer spectral clustering results because the topological features of the graph naturally reflect equitable group ratios. This research fills an essential shortcoming in fair unsupervised learning, by illustrating how topological fairness in graph construction inherently facilitates fairer spectral clustering results without the need for changes to the clustering algorithm itself. Thorough experiments on three synthetic datasets, seven real-world tabular datasets, and three real-world image datasets prove that our fair graph construction methods surpass the current baselines in graph clustering tasks. Machine learning algorithms are widely used for decision-making in a variety of fields, including criminal justice [1], healthcare [2], [3], and finance [4]. The reason for this is that these algorithms have been shown to be very accurate and effective at analyzing big datasets. The increasing prevalence of these algorithms has raised questions regarding their fairness and potential to reinforce societal biases [5], [6]. These biases can result in unfair treatment of certain groups of people thereby create significant societal implications. Recently, concerns have been raised about the fairness of clusters produced by popular clustering algorithms.


Towards Robust Graph Structural Learning Beyond Homophily via Preserving Neighbor Similarity

Zhu, Yulin, Lai, Yuni, Ai, Xing, LO, Wai Lun, Li, Gaolei, Li, Jianhua, Tang, Di, Zhang, Xingxing, Yang, Mengpei, Zhou, Kai

arXiv.org Artificial Intelligence

Despite the tremendous success of graph-based learning systems in handling structural data, it has been widely investigated that they are fragile to adversarial attacks on homophilic graph data, where adversaries maliciously modify the semantic and topology information of the raw graph data to degrade the predictive performances. Motivated by this, a series of robust models are crafted to enhance the adversarial robustness of graph-based learning systems on homophilic graphs. However, the security of graph-based learning systems on heterophilic graphs remains a mystery to us. To bridge this gap, in this paper, we start to explore the vulnerability of graph-based learning systems regardless of the homophily degree, and theoretically prove that the update of the negative classification loss is negatively correlated with the pairwise similarities based on the powered aggregated neighbor features. The theoretical finding inspires us to craft a novel robust graph structural learning strategy that serves as a useful graph mining module in a robust model that incorporates a dual-kNN graph constructions pipeline to supervise the neighbor-similarity-preserved propagation, where the graph convolutional layer adaptively smooths or discriminates the features of node pairs according to their affluent local structures. In this way, the proposed methods can mine the ``better" topology of the raw graph data under diverse graph homophily and achieve more reliable data management on homophilic and heterophilic graphs.


Scalable Varied-Density Clustering via Graph Propagation

Pham, Ninh, Zheng, Yingtao, Phibbs, Hugo

arXiv.org Artificial Intelligence

We propose a novel perspective on varied-density clustering for high-dimensional data by framing it as a label propagation process in neighborhood graphs that adapt to local density variations. Our method formally connects density-based clustering with graph connectivity, enabling the use of efficient graph propagation techniques developed in network science. To ensure scalability, we introduce a density-aware neighborhood propagation algorithm and leverage advanced random projection methods to construct approximate neighborhood graphs. Our approach significantly reduces computational cost while preserving clustering quality. Empirically, it scales to datasets with millions of points in minutes and achieves competitive accuracy compared to existing baselines.


NOMAD Projection

Duderstadt, Brandon, Nussbaum, Zach, van der Maaten, Laurens

arXiv.org Artificial Intelligence

The rapid adoption of generative AI has driven an explosion in the size of datasets consumed and produced by AI models. Traditional methods for unstructured data visualization, such as t-SNE and UMAP, have not kept up with the pace of dataset scaling. This presents a significant challenge for AI explainability, which relies on methods such as t-SNE and UMAP for exploratory data analysis. In this paper, we introduce Negative Or Mean Affinity Discrimination (NOMAD) Projection, the first method for unstructured data visualization via nonlinear dimensionality reduction that can run on multiple GPUs at train time. W e provide theory that situates NOMAD Projection as an approximate upper bound on the InfoNC-t-SNE loss, and empirical results that demonstrate NOMAD Projection's superior performance and speed profile compared to existing state-of-the-art methods. W e demonstrate the scalability of NOMAD Projection by computing the first complete data map of Multilingual Wikipedia. CVPR 2025 Tutorial - Identifying Structure in Data: All you need to know about Dimensionality Reduction, Clustering, and More 1. Introduction The discovery of neural scaling laws has resulted in an explosion in the size of datasets consumed and produced by AI models [11] [9]. Traditional algorithms for unstructured data visualization, such as t-SNE [14] and UMAP [15], have not kept up with the pace of dataset scaling. The presents a significant challenge for data-centric AI explainability, since it relies upon methods like t-SNE and UMAP for exploratory data analysis.


Scale-Free Graph-Language Models

Lu, Jianglin, Liu, Yixuan, Zhang, Yitian, Fu, Yun

arXiv.org Artificial Intelligence

Graph-language models (GLMs) have demonstrated great potential in graph-based semi-supervised learning. A typical GLM consists of two key stages: graph generation and text embedding, which are usually implemented by inferring a latent graph and finetuning a language model (LM), respectively. However, the former often relies on artificial assumptions about the underlying edge distribution, while the latter requires extensive data annotations. To tackle these challenges, this paper introduces a novel GLM that integrates graph generation and text embedding within a unified framework. We unexpectedly find that this natural property can be effectively approximated by a simple k -nearest neighbor (KNN) graph. For text embedding, we develop a graph-based pseudo-labeler that utilizes scale-free graphs to provide complementary supervision for improved LM finetuning. Extensive experiments on representative datasets validate our findings on the scale-free structural approximation of KNN graphs and demonstrate the effectiveness of integrating graph generation and text embedding with a real structural prior. Recently, graph-language models (GLMs) have been widely explored in graph-based semi-supervised classification on documents, especially for citation networks (Qin et al., 2023; Y u et al., 2025; Lu et al., 2023; He et al., 2024). When designing a GLM for classification, two key challenges arise: graph generation --how to generate a reasonable graph structure for the given documents, and text embedding --how to encode the textual sequences into meaningful semantic features. To address these problems, various GLMs have been proposed, which can be broadly categorized into latent graph inference (LGI) models and language-assisted graph (LAG) models. LGI models focus on graph generation and typically rely on feature engineering approaches, such as bag-of-words (Harris, 1954), TF-IDF (Aizawa, 2003), and skip-gram (Mikolov et al., 2013), to encode textual sequences into shallow representations.


Structure-Guided Input Graph for GNNs facing Heterophily

Tenorio, Victor M., Navarro, Madeline, Rey, Samuel, Segarra, Santiago, Marques, Antonio G.

arXiv.org Artificial Intelligence

Graph Neural Networks (GNNs) have emerged as a promising tool to handle data exhibiting an irregular structure. However, most GNN architectures perform well on homophilic datasets, where the labels of neighboring nodes are likely to be the same. In recent years, an increasing body of work has been devoted to the development of GNN architectures for heterophilic datasets, where labels do not exhibit this low-pass behavior. In this work, we create a new graph in which nodes are connected if they share structural characteristics, meaning a higher chance of sharing their labels, and then use this new graph in the GNN architecture. To do this, we compute the k-nearest neighbors graph according to distances between structural features, which are either (i) role-based, such as degree, or (ii) global, such as centrality measures. Experiments show that the labels are smoother in this newly defined graph and that the performance of GNN architectures improves when using this alternative structure.


Reviews: A Theory-Based Evaluation of Nearest Neighbor Models Put Into Practice

Neural Information Processing Systems

SUMMARY: The paper studies the problem of testing whether a graph is epsilon-far from a kNN graph, where epsilon-far means that at least epsilon-fraction of the edges need to be changed in order to make the graph a kNN graph. The paper presents an algorithm with an upper bound of O(\sqrt{n}*k 2/\epsilon 2) number of edge/vertex queries and a lower bound of \Omega(\sqrt{n}). I guess "\omega" should be "p" 2. The result of Proposition 12 is interesting which bounds the number of points that can be in the kNN set of a particular point. The bound is k times the known bound for 1-NN. I wonder if this could be tightened somehow.


EggNet: An Evolving Graph-based Graph Attention Network for Particle Track Reconstruction

Calafiura, Paolo, Chan, Jay, Delabrouille, Loic, Wang, Brandon

arXiv.org Machine Learning

Track reconstruction is a crucial task in particle experiments and is traditionally very computationally expensive due to its combinatorial nature. Recently, graph neural networks (GNNs) have emerged as a promising approach that can improve scalability. Most of these GNN-based methods, including the edge classification (EC) and the object condensation (OC) approach, require an input graph that needs to be constructed beforehand. In this work, we consider a one-shot OC approach that reconstructs particle tracks directly from a set of hits (point cloud) by recursively applying graph attention networks with an evolving graph structure. This approach iteratively updates the graphs and can better facilitate the message passing across each graph. Preliminary studies on the TrackML dataset show better track performance compared to the methods that require a fixed input graph.


Lightweight Spatial Modeling for Combinatorial Information Extraction From Documents

Dong, Yanfei, Deng, Lambert, Zhang, Jiazheng, Yu, Xiaodong, Lin, Ting, Gelli, Francesco, Poria, Soujanya, Lee, Wee Sun

arXiv.org Artificial Intelligence

Documents that consist of diverse templates and exhibit complex spatial structures pose a challenge for document entity classification. We propose KNN-former, which incorporates a new kind of spatial bias in attention calculation based on the K-nearest-neighbor (KNN) graph of document entities. We limit entities' attention only to their local radius defined by the KNN graph. We also use combinatorial matching to address the one-to-one mapping property that exists in many documents, where one field has only one corresponding entity. Moreover, our method is highly parameter-efficient compared to existing approaches in terms of the number of trainable parameters. Despite this, experiments across various datasets show our method outperforms baselines in most entity types. Many real-world documents exhibit combinatorial properties which can be leveraged as inductive biases to improve extraction accuracy, but existing datasets do not cover these documents. To facilitate future research into these types of documents, we release a new ID document dataset that covers diverse templates and languages. We also release enhanced annotations for an existing dataset.